From text to numbers
2025-01-14
Bag of words featurization quantifies the frequency of words in text documents for processing in machine learning models
| n-gram Size | Description | n-gram Examples |
|---|---|---|
| Unigram (1-gram) | Single word | “Natural,” “language,” “processing” |
| Bigram (2-gram) | Sequence of two consecutive words | “Natural language,” “language processing” |
| Trigram (3-gram) | Sequence of three consecutive words | “Natural language processing” |
| 4-gram | Sequence of four consecutive words | “Natural language processing tasks” |
When we rank observations by their frequency, the frequency of a specific observation occurrring is inversely proportional to its rank.
| Aspect | Lemmatization | Stemming |
|---|---|---|
| Process | Context-aware, dictionary-based | Rule-based, heuristic chopping |
| Output | Produces valid dictionary words | May produce non-words (eg, “running” → “runn”) |
| Context | Considers grammatical structure | Ignores grammatical context |
| Accuracy | High (linguistically accurate) | Lower (prone to errors) |
| Speed | Slower due to dictionary lookup | Faster because it’s rule-based |
| Examples | “running” → “run”, “better” → “good” | “running” → “run”, “better” → “better” |
A matrix where rows represent terms (e.g., words, phrases, or n-grams) and columns represent documents.
A matrix where rows represent documents and columns represent terms (e.g., words, phrases, or n-grams)
Each cell contains a value that represents the occurrence or importance of the term in the document
Can be used for same purposes as TDF
Very sparse, when large vocabulary size
| Term | Doc1 | Doc2 | Doc3 |
|---|---|---|---|
| climate | 5 | 2 | 0 |
| president | 3 | 0 | 1 |
| immigration | 2 | 4 | 6 |
| economy | 1 | 1 | 3 |
| Document | climate | president | immigration | economy |
|---|---|---|---|---|
| Doc1 | 5 | 3 | 2 | 1 |
| Doc2 | 2 | 0 | 4 | 1 |
| Doc3 | 0 | 1 | 6 | 3 |
| Feature | Term-Document Matrix (TDM) | Document-Term Matrix (DTM) |
|---|---|---|
| Rows | Terms (words or phrases) | Documents |
| Columns | Documents | Terms (words or phrases) |
| Primary Use | Explore term trends across documents | Explore document trends across terms |
| Transposition | Can be transposed to form a DTM | Can be transposed to form a TDM |
| Applications | Topic modeling, text mining | Text classification, semantic analysis |
| Example Structure | Row = “climate,” Column = “Doc1,” Value = 5 | Row = “Doc1,” Column = “climate,” Value = 5 |
| Dimensionality | Wide structure for large vocabularies | Tall structure for large document collections |
High tf-idf: frequent term in the document but rare in the overall corpus –> more unique or important for that document
Low tf-idf: term is either frequent across the corpus (e.g., stopwords like “the,” “and”) or infrequent in the document, meaning it has less distinguishing power
Input for machine learning models:
Information Retrieval:
Anomaly Detection
Quantify and compare the positions of documents (e.g., texts, articles, speeches) along one or more dimensions based on their linguistic content
An uninformed scaling application by myself a couple of years ago